Wildfires in the US

Statistical computation and visualization (MATH-517)

Zineb AGNAOU https://github.com/ZinebAg , Fahim BECK https://github.com/FahimBeck , Salima JAOUA https://github.com/salimajaoua , Matias JANVIN https://github.com/matiasjanvin , Seorim PARK https://github.com/seorimpark
November 16, 2021

Introduction

Research questions

Approaches

Sources of information / datasets

Transformations to talk :

Packages

Here are some packages used in our project. Make sure they are installed before running the R Markdown file. Since the format of this Markdown is not standard, you need to install the Distill package. To avoid errors, install it via Github (the package is updated there) with the following commands:
install.packages("devtools")
devtools::install_github("rstudio/distill")

Exploratory Data Analysis

add reference to the introduction and mention it is majorly inspired by the one present on the website:

Wildfires are uncontrolled fires of fuel materials composed of mostly natural vegetation, such as woods or shrublands. They represent an environmental threat with major impacts all over the world, and their frequency of outbreak and severity are expected to increase with global warming Jones et al. 2020. Each year, wildfires cause many direct fatalities in humans, as well as extreme air pollution events and loss of both biodiversity and other valuable ecosystem services. They also contribute a substantial fraction of global greenhouse gas emissions each year, so they can further worsen climate change. The modeling of wildfires has been approached using a wide variety of statistical approaches (see, e.g., Preisler et al. 2004, Pereira & Turkman 2019, Xi et al. 2019), some of which employ extreme value theory (see, e.g., Pereira & Turkman 2019).

As for the origin of wildfire starts, lightning is the main natural cause, but in the overwhelming majority of cases, human activities are responsible. They can be intentional (arson, children) or accidental (burning debris, agricultural activities, campfires, smoking). In general, wildfires are the result of a combination of the presence of combustible materials (e.g., a forest), their easy flammability (e.g., as a result of extreme weather conditions such as droughts), and a triggering event (e.g., lightning or human activity). In most fire-prone regions of the world, wildland fire activity has seasonal cycles due to the seasonality of favorable weather conditions.

To assist in wildland fire management, it is essential to understand and predict the risk factors that contribute to wildland fires and their spatial and seasonal distribution. Wildland fire management involves a multitude of tasks, including monitoring forest ecosystems, deploying preventative measures, firefighting logistics, and short-term forecasting and long-term projections of wildland fire activity.

With the chosen data set, we can concentrate on two important components of wildland fire activeness: fire occurrence and fire magnitude. Given a region of space and time, we consider the number of spatially separated wildfire events as an observation with respect to the first aspect (occurrence), and the aggregate burned area of wildfires originating from the area of interest as an observation with respect to the second aspect (size).

The dataset contains 563,983 rows for 37 columns. The columns are the following:

Please note that the area proportions lc1 to lc18 do not always sum to exactly 1 for each pixel and month since a few classes with quasi-0 proportion have been removed.

Since the original data was given under the context of a prediction competition with the university of Edimburgh, there is a 8,000 of missing values in each of the columns CNT and BA. The missing values are not located necessarly in the same lines for the two features.

Visualisation and interactive / animated maps

Wildfires over time

Wildfires and the land cover:

Corrolation between Land covers and Wildfires:

We want to determine if there is a significant correlation between the different land covers and the appearance of fires. In order to do so, we use a database filtered of missing data. As can be seen in Figure (?), the correlation between the fires remains very low. To confirm this statement, we plot in Figure(?) the variables that have been defined as highly correlated in red and the rest in light blue. We can then see that there is no strong correlation between the columns nor between the columns and the apparition of the fires. We define a correlation as strong when its absolute value exceeds 0.8.

Landcover over time and location:

In this section, we will study the distribution of land covers over time. To do so, we have selected the geographical coordinates of Malibu, California. This area has been heavily impacted by wildfires in recent years. In November 2018, the wealthy coastal enclave of Malibu was engulfed by the Woolsey Fire, which spread to over 96,000 acres of land outside of LA and is now 35% contained. At least two people were pronounced dead in Malibu on Friday. Their burned bodies were found in a car near Mulholland Highway. In Figure (), we can see the exact location of the studied point on the map of the United States.

In Figure (), we can see the distribution of the landcovers with time. We can see that the proportions are relatively stable and consistent with time. The predominant cover is the urban area with 31% of the surface in 1993 this proportion only increases with time.

Although in this particular example of Malibu, the proportion of different landcovers seems consistent, this is not always the case. Taking the coordinates in Figure () in Wisconsin, we can see that the proportions change drastically depending on the year. The percentage of land representing tree broadleaved evergreen closed to open increases with time until it becomes the predominant one, while shrub or herbaceous cover flooded decreases drastically with time as seen in Figure(). We will not analyze the reasons and parameters that may have motivated this change, but we will try to quantify this change in the next section.

To access the intarctive part, click this link. In this app you will able to select the longitude and lattitude wanted and see the distribution of the landcovers.

Predominant Land covers Analysis and Shifts:

We will first proceed by representing the territory according to the predominant type of coverage. Figure () shows this arrangement in the year 2000. We observe that water dominates the maritime borders, tree broadleaved are predominant in the east of the country while the west is dominated by tree needleleave. We can easily identify the cities of New York, Los angeles and other large cities by their predominance of urban space.

To access the intarctive part, click this link. In this app you will able to select the year and the predominant landcover will display directly on the us map.

To analyze the variation, we create a vector whose values start at zero for any given location and increases by one as soon as the predominant coverage area changes from one month to the next. What is first interesting to note is that all locations change their predominant type of space at least once. Indeed in Figure (), we see that the shift values start at 2. We also see that the shifts do not exceed 10, knowing that this measurement is taken over 22 years, this is still relatively low.

Land covers and meteorogical correlation

We want to understand the correlation between land cover and meteorological variables. To do so, we are going to clean our data first, rename the variables by their description so that we can easily analyze our results and then compute the correlation between the variables. The problem is that when trying to output the matrix, we get a warning saying that the matrix is too big. To get an idea of what variable can be highly related we are going to output the correlation between a pair of land cover and meteorological variable when it is bigger than a threshold manually set.

[1] "Important correlations for the land cover variable number 1:  0"
[1] "Important correlations for the land cover variable number 2:  0"
[1] "Important correlations for the land cover variable number 3:  0"
[1] "Important correlations for the land cover variable number 4:  0"
[1] "Important correlations for the land cover variable number 5:  0"
[1] "Important correlations for the land cover variable number 6:  0"
[1] "Important correlations for the land cover variable number 7:  0"
[1] "Important correlations for the land cover variable number 8:  0"
[1] "Important correlations for the land cover variable number 9:  0"
[1] "Important correlations for the land cover variable number 10:  0"
[1] "Important correlations for the land cover variable number 11:  0"
[2] "Important correlations for the land cover variable number 11:  7"
[1] "Important correlations for the land cover variable number 12:  0"
[1] "Important correlations for the land cover variable number 13:  0"
[1] "Important correlations for the land cover variable number 14:  0"
[1] "Important correlations for the land cover variable number 15:  0"
[1] "Important correlations for the land cover variable number 16:  0"
[1] "Important correlations for the land cover variable number 17:  0"
[1] "Important correlations for the land cover variable number 18:  0"

Here, we choose a threshold of 0.6 and none of the correlations has been outputted. This means that all of them are less that 0.6. In fact, when printing all the correlations (threshold = 0) we remark that the values are very small. This can come from the fact that there are no correlations between the variables or that the correlation is not linear here. So we are going to compute the correlation with spearman method.

[1] "Important correlations for the land cover variable number 1:  0"
[1] "Important correlations for the land cover variable number 2:  0"
[1] "Important correlations for the land cover variable number 3:  0"
[1] "Important correlations for the land cover variable number 4:  0"
[1] "Important correlations for the land cover variable number 5:  0"
[1] "Important correlations for the land cover variable number 6:  0"
[1] "Important correlations for the land cover variable number 7:  0"
[1] "Important correlations for the land cover variable number 8:  0"
[1] "Important correlations for the land cover variable number 9:  0"
[1] "Important correlations for the land cover variable number 10:  0"
[1] "Important correlations for the land cover variable number 11:  0"
[2] "Important correlations for the land cover variable number 11:  7"
[1] "Important correlations for the land cover variable number 12:  0"
[1] "Important correlations for the land cover variable number 13:  0"
[1] "Important correlations for the land cover variable number 14:  0"
[1] "Important correlations for the land cover variable number 15:  0"
[1] "Important correlations for the land cover variable number 16:  0"
[2] "Important correlations for the land cover variable number 16:  8"
[1] "Important correlations for the land cover variable number 17:  0"
[1] "Important correlations for the land cover variable number 18:  0"
[2] "Important correlations for the land cover variable number 18:  8"

Now by changing the method, we observe that we get that only the land cover variable 11 (shrubland) and the surface net thermal radiation are correlated. But the results are not as we expected, so we are going to regroup land covers that are similar.

Spreading fire based on the type of land cover

First, to analyze the spread of a fire, we are going to look for the distribution of the number of fires in a grid during a month.

We remark that there most of the values next to 0 and because of the outliers we can’t see the distribution very good. So we are going to filter the outliers to have a better view of the distrubution

Now, since we want to predict the spread of a fire given the land covers data, let’s plot the distribution of the number of fire according to each land cover variable.

In this section, we try to model the fact that there was a fire in a grid for a given month given the land cover specification of the grid. The problem is that we have the number of fire in the grid. One can think of using the zero inflated Poisson regression. In fact, the zero inflated Poisson is used to count data with excess zeros and overdispersion, which describe well our data. It combines the Poisson distribution and the logit distribution.


Call:
zeroinfl(formula = CNT ~ lc1 + lc2 + lc3 + lc4 + lc5 + lc6 + 
    lc7 + lc8 + lc9 + lc10 + lc11 + lc12 + lc13 + lc14 + lc15 + 
    lc16 + lc17 + lc18, data = df)

Pearson residuals:
     Min       1Q   Median       3Q      Max 
 -2.6002  -0.6556  -0.4760  -0.1269 164.8916 

Count model coefficients (poisson with log link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.4275     0.1150  -3.718 0.000201 ***
lc1           3.8936     0.1317  29.561  < 2e-16 ***
lc2           1.7546     0.1152  15.234  < 2e-16 ***
lc3           0.1621     0.1427   1.136 0.255880    
lc4           2.5901     0.1170  22.144  < 2e-16 ***
lc5           2.3983     0.1148  20.897  < 2e-16 ***
lc6          -7.7769     0.4706 -16.526  < 2e-16 ***
lc7           2.6995     0.1149  23.490  < 2e-16 ***
lc8           2.3755     0.1158  20.514  < 2e-16 ***
lc9           1.0370     0.1172   8.845  < 2e-16 ***
lc10          4.5284     0.1232  36.762  < 2e-16 ***
lc11          1.5169     0.1153  13.158  < 2e-16 ***
lc12          1.9933     0.1155  17.256  < 2e-16 ***
lc13         -1.5373     0.1658  -9.271  < 2e-16 ***
lc14          2.7590     0.1212  22.764  < 2e-16 ***
lc15          2.5966     0.1282  20.260  < 2e-16 ***
lc16          5.8753     0.1176  49.973  < 2e-16 ***
lc17         -0.3846     0.1310  -2.937 0.003314 ** 
lc18          1.4079     0.1162  12.121  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -9.8587     0.4163  -23.68  < 2e-16 ***
lc1           9.8726     0.4567   21.62  < 2e-16 ***
lc2          11.9956     0.4168   28.78  < 2e-16 ***
lc3          12.7570     0.4947   25.79  < 2e-16 ***
lc4           9.9462     0.4243   23.44  < 2e-16 ***
lc5          10.0305     0.4162   24.10  < 2e-16 ***
lc6          18.0285     1.0028   17.98  < 2e-16 ***
lc7           8.0604     0.4164   19.36  < 2e-16 ***
lc8           9.4954     0.4201   22.60  < 2e-16 ***
lc9          11.0770     0.4201   26.37  < 2e-16 ***
lc10          1.2508     0.4685    2.67  0.00759 ** 
lc11         10.7596     0.4172   25.79  < 2e-16 ***
lc12         10.8828     0.4172   26.08  < 2e-16 ***
lc13         16.6698     0.5593   29.81  < 2e-16 ***
lc14          8.6888     0.4388   19.80  < 2e-16 ***
lc15          9.2563     0.4608   20.09  < 2e-16 ***
lc16          7.3972     0.4386   16.87  < 2e-16 ***
lc17         12.1465     0.4312   28.17  < 2e-16 ***
lc18         11.4647     0.4180   27.43  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 43 
Log-likelihood: -1.138e+06 on 38 Df

Wildfires and meteorogical factors

In this section, we try to find some meteorological factors that could potentially trigger a wildfire. To do so, we will plot a Pearson correlation heatmap. The first step is to get a subset from our initial dataset that is more representative of wildfire conditions. As a fire is quite a rare event, we will not be able to find any links if we were considering the entire population (the correlations would be around 0).

To get our subset, we will only consider areas per month which have at least a certain number of wildfires and which have been burnt above a certain threshold. For our example, we take rows with more than 60 wildfires (\(CNT \geq 60\)) and 5000 acres of aggregated burnt area (\(BA \geq 5000\), approximately 20km2) in the current month.

We check whether the period in years of the subset matches that of the initial dataset, i.e. 1993 to 2015. We see that it does.

range(selection$year)
[1] 1993 2015

As a result, a correlation heatmap will allow us to distinguish certain factors that could favor a wildfire.

Since we are looking for meteorological risk factors, we may only consider the first two rows or columns of the heatmap.

First, we see that there are no pairs of fully correlated variables. Indeed, the highest value (absolute) is 0.43. The number of wildfires and months are negatively correlated. This can be explained by the fact that months range from March to September and that there are more occurrences of fire in March and April than in August and September. The fact that the numbers are small shows that the causes of a wildfire are meteorologically multifactorial.

An interesting observation is that \(CNT\) and \(BA\) are negatively correlated, meaning that the more area burnt, the fewer wildfires there are. Most of the time, the signs of correlations in the first two rows are not the same for these two variables. For example, by looking at temperatures and solar radiation, the higher they are, the more area burnt, but the less wildfires. This can be explained by the fact that our subset contains substantial wildfires, so there aren’t many, but they are destructive.

Another variable that we can comment on is precipitation. For an area to be burnt, there must be no rain or high humidity conditions, hence the negative correlations between \(BA\) and precipitation. The positive correlation with \(CNT\) could be explained by unstable weather conditions. As with wind speed, the higher it is, the more unstable the air masses and the greater the risk of a natural disaster.

But we should be careful with these correlations as we used a subset and as their values are not that high in absolute value.

Next, a map of the US represents the areas considered in the subset, regardless of months and years.

It can be seen that the areas represented are not that many in number and are located where the forest cover is the most important. For example, there are almost no red squares in the central region of the United States, as this area is mostly non-forest land.

For other thresholds, an interactive app is available here. Instructions and explanations are displayed on the app. Basically, you have the same results as above, but thresholds for \(CNT\) and \(BA\) are entered by the user.

Discussion and conclusion

Future improvements